Tunable Word-Level Index Compression for Versioned Corpora
نویسندگان
چکیده
This paper presents a tunable index compression scheme for supporting time-travel phrase queries over large versioned corpora such as web archives. Support for phrase queries makes maintenance of word positions necessary, thus increasing the index size significantly. We propose to fuse the word positions in many neighboring versions of a document, and thus exploit the typically high level of redundancy and compressibility to shrink the index size. The resulting compression scheme called FUSION, can be tuned to trade off compression for query-processing overheads. Our experiments on the revision history of Wikipedia demonstrate the effectiveness of our method.
منابع مشابه
A Dictionary-Based Multi-Corpora Text Compression System
In this paper we introduce StarZip, a multi-corpora text compression system, together with its transform engine StarNT. StarNT achieves a superior compression ratio than almost all the other recent efforts based on BWT and PPM. StarNT is a dictionary-based fast lossless text transform. The main idea is to recode each English word with a representation of no more than three symbols. This transfo...
متن کاملVocabulary Lists for EAP and Conversation Students
Despite the abundance of research investigating general and academic vocabularies and developing dozens of word lists, few studies have compared academic vocabulary with general service word lists such as conversation vocabulary. Many EAP researchers assume that university students need to know all the words in West’s (1953) General Service List (GSL) as a prerequisite to academic words (e.g., ...
متن کاملThe E ect of Pruning and Compression on Graphical Representations of the Output of a Speech Recognizer
Large vocabulary continuous speech recognition can bene t from an e cient data structure for representing a large number of acoustic hypotheses compactly. Word graphs or lattices have been chosen as such an e cient interface between acoustic recognition engines and subsequent language processing modules. This paper rst investigates the e ect of pruning during acoustic decoding on the quality of...
متن کاملSelf - Indexing Based on LZ 77 ? Sebastian
We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that sou...
متن کاملThe effect of pruning and compression on graphical representations of the output of a speech recognizer
Large vocabulary continuous speech recognition can benefit from an efficient data structure for representing a large number of acoustic hypotheses compactly. Word graphs or lattices have been chosen as such an efficient interface between acoustic recognition engines and subsequent language processing modules. This paper first investigates the effect of pruning during acoustic decoding on the qu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008